02 June 2014
"The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill in the next decades, … because now we really do have essentially free and ubiquitous data. So the complimentary scarce factor is the ability to understand that data and extract value from it."
Hal Varian, Google’s Chief Economist
“What information consumes is rather obvious: it consumes the attention of its recipients. Hence a wealth of information creates a poverty of attention, and a need to allocate that attention efficiently among the overabundance of information sources that might consume it.”
Herb Simon
The Pythagoras' theorem is a relation in Euclidean geometry among the three sides of a right triangle. It states that the square of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the other two sides.
For all \(\triangle XYZ\), where \(\angle XYZ = 90^\circ\) and length of side \(XY = a\), \(YZ = b\) and \(XZ = c\), there exist a relationship such that:
\(a^2 + b^2 = c^2\)
The First Six Books of The Element of Euclid by Oliver Byrne
Simple
Complex
Derived from the Latin verb videre, "to look, to see"
"The act or instance to form a mental image or picture (without an object) or… the act or instance to make visible or visual (with an object)"
“Transformation of the symbolic into the geometric” - McCormick et al. 1987
“The use of computer-generated, interactive, visual representations of abstract data to amplify cognition.” - Card, Mackinlay, & Shneiderman 1999
Expression | Decorative - Data Art for visual expression, delight (and impact)
Flight Patterns, Internet Census
Exploration | Interactive - Data Tool for engagement, exploration and discovery
Cricket Batting, Working Capital Profiler
Explanation | Narrative - Data Stories for telling a specific and (mostly linear) visual narrative
The Joy of Stats, Wealth Inequality, Out of Sight, Out of Mind
Graph of rate of evaporation of water vs. temperature
Book: Sémiologie graphique / Semiology of Graphics
Visual language is a sign language
“… finding the artificial memory that best supports our natural means of perception.”
∴ Encode quantitative variables
"Resemblance, order and proportion are the three signifieds in graphics.” - Bertin
anscombe
## x1 x2 x3 x4 y1 y2 y3 y4 ## 1 10 10 10 8 8.04 9.14 7.46 6.58 ## 2 8 8 8 8 6.95 8.14 6.77 5.76 ## 3 13 13 13 8 7.58 8.74 12.74 7.71 ## 4 9 9 9 8 8.81 8.77 7.11 8.84 ## 5 11 11 11 8 8.33 9.26 7.81 8.47 ## 6 14 14 14 8 9.96 8.10 8.84 7.04 ## 7 6 6 6 8 7.24 6.13 6.08 5.25 ## 8 4 4 4 19 4.26 3.10 5.39 12.50 ## 9 12 12 12 8 10.84 9.13 8.15 5.56 ## 10 7 7 7 8 4.82 7.26 6.42 7.91 ## 11 5 5 5 8 5.68 4.74 5.73 6.89
Mean
\(\mu_x = 9\); \(\mu_y = 7.5\)
Variance and Correlation
\(\sigma^2_x = 11\); \(\sigma^2_x = 4.1\); \(cor(x,y) = 0.816\)
Linear Regression
\(y = 3.00 + 0.500x\)
\(R^2 = 0.667\)
Exploratory Data Analysis: An approach to analyze data sets to summarize their main characteristics, often with visual methods
Book: The Visual Display of Quantiative Information
“Above all else, show the data.”
Data-Ink ratio = data-ink / total-ink used in graphics
Don't lie with statistics
Book: Element of Graphing Data
"The important criterion for a graph is not simply how fast we can see a result; rather it is whether through the use of the graph we can see something that would have been harder to see otherwise or that could not have been seen at all."
| Aspects | Macintosh | MacBook | Change |
|---|---|---|---|
| Year | 1984 | 2014 | +30 |
| Cost | $2,500 | $999 | 2/5x |
| Speed | 8MHz | 1.4GHz | 175x |
| Memory | 128KB | 4GB | 30,000x |
| Pixels | 512 x 342 | 1440 x 900 | 7.4x |
| Screen | 72PPI (9in) | 128PPI (13.3in) | 1.8x |
Book: The Grammar of Graphics
Grammar: “the fundamental principles or rules of an art or science”
"…rules for constructing graphs mathematically and then representing them as graphics aesthetically."
Three metaphors for thinking about visualization
Base Graphics: Written by Ross Ihaka based on experience from S graphics. A pen on paper model and there is no (user accessible) representation of the graphics. Base graphics functions are generally fast, but have limited scope.
grid graphics: Developed by Paul Murrell (2000), Grid grobs (graphical objects) can be represented independently of the plot and modified later. Grid provides drawing primitives, but no tools for producing statistical graphics.
lattice: Developed by Deepayan Sarkar (2008), uses grid graphics to implement the trellis graphics system of Cleveland. You can easily produce conditioned plots but it lacks a formal model
ggplot2: Developed by Hadley Wickam (2007), takes the good things of lattice with the underlying layered grammar of graphics approach. Easy to draw wide range of graphics with compact syntax and independent components
install.packages('ggplot2')
library(ggplot2)
Main arguments
General ggplot syntax
ggplot(data, aes(…)) + geom_x() + … + stat_x + …
Layer specifications
Additional components: scales, coordinates, facet
data(diamonds) names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price" ## [8] "x" "y" "z"
head(diamonds)
## carat cut color clarity depth table price x y z ## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 ## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 ## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 ## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 ## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 ## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
?diamonds
A data frame with 53940 rows and 10 variables
str(diamonds)
## 'data.frame': 53940 obs. of 10 variables: ## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ... ## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ... ## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ... ## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ... ## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ... ## $ table : num 55 61 65 58 58 57 57 55 61 61 ... ## $ price : int 326 326 327 334 335 336 336 337 337 338 ... ## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ... ## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ... ## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Categorical
Bar, Stacked, CoxComb, Pie, Bulls-eye
Continuous Variables
Histogram, BoxPlot
summary(diamonds$clarity)
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF ## 741 9194 13065 12258 8171 5066 3655 1790
ggplot(diamonds, aes(clarity)) + geom_bar()
ggplot(diamonds, aes(clarity, fill=clarity)) + geom_bar()
ggplot(diamonds, aes(clarity, fill=clarity)) + geom_bar(width = 1)
ggplot(diamonds, aes(clarity, fill=clarity)) + geom_bar(width = 1) + coord_polar()
ggplot(diamonds, aes(x="", fill=clarity)) + geom_bar()
ggplot(diamonds, aes(x= "", fill=clarity)) + geom_bar() + coord_polar(theta = "y")
ggplot(diamonds, aes(x= "", fill=clarity)) + geom_bar(width = 1) + coord_polar(theta = "x")
summary(diamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 326 950 2400 3930 5320 18800
ggplot(diamonds, aes(price)) + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 500)
ggplot(diamonds, aes(price)) + geom_histogram(binwidth = 50)
ggplot(diamonds, aes(price, fill=..count..)) + geom_histogram(binwidth = 50)
ggplot(diamonds, aes("", price)) +
geom_boxplot()
ggplot(diamonds, aes(x="", price)) + geom_boxplot() + coord_flip()
Categorical vs. Categorical
Stacked Bar, Mosaic
Continuous vs. Categorical
Histogram - Aesthetics, Facets, Frequency Polygon, Density
Continuous vs. Continuous
Scatterplot - Aesthetics, Facets
by(diamonds$cut, diamonds$clarity, summary)
## diamonds$clarity: I1 ## Fair Good Very Good Premium Ideal ## 210 96 84 205 146 ## -------------------------------------------------------- ## diamonds$clarity: SI2 ## Fair Good Very Good Premium Ideal ## 466 1081 2100 2949 2598 ## -------------------------------------------------------- ## diamonds$clarity: SI1 ## Fair Good Very Good Premium Ideal ## 408 1560 3240 3575 4282 ## -------------------------------------------------------- ## diamonds$clarity: VS2 ## Fair Good Very Good Premium Ideal ## 261 978 2591 3357 5071 ## -------------------------------------------------------- ## diamonds$clarity: VS1 ## Fair Good Very Good Premium Ideal ## 170 648 1775 1989 3589 ## -------------------------------------------------------- ## diamonds$clarity: VVS2 ## Fair Good Very Good Premium Ideal ## 69 286 1235 870 2606 ## -------------------------------------------------------- ## diamonds$clarity: VVS1 ## Fair Good Very Good Premium Ideal ## 17 186 789 616 2047 ## -------------------------------------------------------- ## diamonds$clarity: IF ## Fair Good Very Good Premium Ideal ## 9 71 268 230 1212
ggplot(diamonds, aes(x=cut, fill=clarity)) + geom_bar()
ggplot(diamonds, aes(x=cut, fill=clarity)) + geom_bar(position = "dodge")
ggplot(diamonds, aes(x=cut, fill=clarity)) + geom_bar(position = "fill")
No direct function - But you can easily write it
ggMMplot <- function(var1, var2){
require(ggplot2)
levVar1 <- length(levels(var1))
levVar2 <- length(levels(var2))
jointTable <- prop.table(table(var1, var2))
plotData <- as.data.frame(jointTable)
plotData$marginVar1 <- prop.table(table(var1))
plotData$var2Height <- plotData$Freq / plotData$marginVar1
plotData$var1Center <- c(0, cumsum(plotData$marginVar1)[1:levVar1 -1]) +
plotData$marginVar1 / 2
ggplot(plotData, aes(var1Center, var2Height)) +
geom_bar(stat = "identity", aes(width = marginVar1, fill = var2), col = "Black") +
geom_text(aes(label = as.character(var1), x = var1Center, y = 1.05))}
ggMMplot(diamonds$cut, diamonds$clarity)
## Warning: position_stack requires constant width: output may be incorrect
by(diamonds$price, diamonds$cut, summary)
## diamonds$cut: Fair ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 337 2050 3280 4360 5210 18600 ## -------------------------------------------------------- ## diamonds$cut: Good ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 327 1140 3050 3930 5030 18800 ## -------------------------------------------------------- ## diamonds$cut: Very Good ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 336 912 2650 3980 5370 18800 ## -------------------------------------------------------- ## diamonds$cut: Premium ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 326 1050 3180 4580 6300 18800 ## -------------------------------------------------------- ## diamonds$cut: Ideal ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 326 878 1810 3460 4680 18800
ggplot(diamonds, aes(price, fill=cut)) + geom_bar(binwidth = 500)
ggplot(diamonds, aes(price, fill=cut)) + geom_bar(binwidth = 500) + facet_wrap(~ cut)
ggplot(diamonds, aes(price, fill=cut)) + geom_bar(binwidth = 500) + facet_wrap(~ cut, scales="free")
ggplot(diamonds, aes(price, color = cut)) + geom_freqpoly(binwidth = 500)
ggplot(diamonds, aes(price, ..density.., color=cut)) + geom_freqpoly(binwidth = 500)
ggplot(diamonds, aes(price, ..density.. , fill=cut)) + geom_bar(binwidth = 500) + facet_wrap(~ cut)
ggplot(diamonds, aes(cut, price, color = cut)) + geom_point()
ggplot(diamonds, aes(cut, price, color = cut)) + geom_jitter()
ggplot(data = diamonds, aes(carat, price)) + geom_point()
ggplot(data = diamonds, aes(carat, price)) + geom_point(size = 1)
ggplot(data = diamonds, aes(carat, price)) +
geom_point(alpha = I(1/20))
ggplot(data = diamonds, aes(carat, price)) +
geom_jitter()
ggplot(diamonds, aes(carat, price)) +
geom_point() + geom_smooth()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.
ggplot(diamonds, aes(carat, price)) + xlim(c(0, 3.1)) + geom_point()
## Warning: Removed 14 rows containing missing values (geom_point).
ggplot(diamonds, aes(carat, price)) + scale_y_log10() + geom_point()
ggplot(diamonds, aes(carat, price, color=cut)) + geom_point()
ggplot(diamonds, aes(carat, price, color=cut)) + geom_point(size=1) + facet_wrap(~ cut)
## troops <- read.table(url("http://amitkaps.com/data/minard-troops.txt"), header = TRUE)
troops <- read.table("minard-troops.txt", header = TRUE)
troops
## long lat survivors direction group ## 1 24.0 54.9 340000 A 1 ## 2 24.5 55.0 340000 A 1 ## 3 25.5 54.5 340000 A 1 ## 4 26.0 54.7 320000 A 1 ## 5 27.0 54.8 300000 A 1 ## 6 28.0 54.9 280000 A 1 ## 7 28.5 55.0 240000 A 1 ## 8 29.0 55.1 210000 A 1 ## 9 30.0 55.2 180000 A 1 ## 10 30.3 55.3 175000 A 1 ## 11 32.0 54.8 145000 A 1 ## 12 33.2 54.9 140000 A 1 ## 13 34.4 55.5 127100 A 1 ## 14 35.5 55.4 100000 A 1 ## 15 36.0 55.5 100000 A 1 ## 16 37.6 55.8 100000 A 1 ## 17 37.7 55.7 100000 R 1 ## 18 37.5 55.7 98000 R 1 ## 19 37.0 55.0 97000 R 1 ## 20 36.8 55.0 96000 R 1 ## 21 35.4 55.3 87000 R 1 ## 22 34.3 55.2 55000 R 1 ## 23 33.3 54.8 37000 R 1 ## 24 32.0 54.6 24000 R 1 ## 25 30.4 54.4 20000 R 1 ## 26 29.2 54.3 20000 R 1 ## 27 28.5 54.2 20000 R 1 ## 28 28.3 54.3 20000 R 1 ## 29 27.5 54.5 20000 R 1 ## 30 26.8 54.3 12000 R 1 ## 31 26.4 54.4 14000 R 1 ## 32 25.0 54.4 8000 R 1 ## 33 24.4 54.4 4000 R 1 ## 34 24.2 54.4 4000 R 1 ## 35 24.1 54.4 4000 R 1 ## 36 24.0 55.1 60000 A 2 ## 37 24.5 55.2 60000 A 2 ## 38 25.5 54.7 60000 A 2 ## 39 26.6 55.7 40000 A 2 ## 40 27.4 55.6 33000 A 2 ## 41 28.7 55.5 33000 A 2 ## 42 28.7 55.5 33000 R 2 ## 43 29.2 54.2 30000 R 2 ## 44 28.5 54.1 30000 R 2 ## 45 28.3 54.2 28000 R 2 ## 46 24.0 55.2 22000 A 3 ## 47 24.5 55.3 22000 A 3 ## 48 24.6 55.8 6000 A 3 ## 49 24.6 55.8 6000 R 3 ## 50 24.2 54.4 6000 R 3 ## 51 24.1 54.4 6000 R 3
plot_troops <- ggplot(troops, aes(long, lat)) + geom_path(aes(size = survivors, color = direction, group = group)) plot_troops
## cities <- read.table(url("http://amitkaps.com/data/minard-cities.txt"), header = TRUE)
cities <- read.table("minard-cities.txt", header = TRUE)
cities
## long lat city ## 1 24.0 55.0 Kowno ## 2 25.3 54.7 Wilna ## 3 26.4 54.4 Smorgoni ## 4 26.8 54.3 Moiodexno ## 5 27.7 55.2 Gloubokoe ## 6 27.6 53.9 Minsk ## 7 28.5 54.3 Studienska ## 8 28.7 55.5 Polotzk ## 9 29.2 54.4 Bobr ## 10 30.2 55.3 Witebsk ## 11 30.4 54.5 Orscha ## 12 30.4 53.9 Mohilow ## 13 32.0 54.8 Smolensk ## 14 33.2 54.9 Dorogobouge ## 15 34.3 55.2 Wixma ## 16 34.4 55.5 Chjat ## 17 36.0 55.5 Mojaisk ## 18 37.6 55.8 Moscou ## 19 36.6 55.3 Tarantino ## 20 36.5 55.0 Malo-Jarosewii
plot_troops_cities <- plot_troops + geom_text(aes(label = city), size = 4, data = cities) plot_troops_cities
library(maps)
library(mapproj)
plot_polished <- plot_troops_cities +
scale_size(range = c(1, 10),
breaks = c(1, 2, 3) * 10^5,
labels = c(1, 2, 3) * 10^5 )+
scale_color_manual(values = c("grey50","red")) +
xlab(NULL) +
ylab(NULL) +
coord_map()
plot_polished
Resources & Books
Courses
You are working as a team member in a large global project to develop the digital ad strategy for your company. As part of the project, you need to provide an overview of the computing devices the consumers are likely to use to interact with these digital ads.
You have received a spreadsheet from an analyst about these computing devices. These computing devices are been tracked in three main categories - PCs (including desktops and laptops), Tablets and Smartphones. The data sheet includes historical and forecasted data on shipments (devices shipped to the consumer) and installed base (devices being used by the consumers) for these computing devices. In addition, you also have the same data segmented by Operating System (OS) being used on each of these devices.
The data sheet by the analyst is available at http://goo.gl/Zy6lcR
You need to develop a short data visualization for this data set and problem statement (using your preferred visualization and presentation tool). Please do use the data shared by the analyst, though you are free to enrich the same with any additional data or insights from external sources.
You will have 5 minutes to share this overview with the global project team as part of the next project discussion. Please prepare the visualizations accordingly.
Amit Kapoor
Partner, narrativeVIZ Consulting